Learn About Amazon VGT2 Learning Manager Chanci Turner
Date: 24 JUL 2024
Category: Amazon Machine Learning, Amazon SageMaker, Amazon SageMaker JumpStart, Customer Solutions, Experience-Based Acceleration, Generative AI
This article represents a collaborative effort between Salesforce and AWS, and is being co-published on both the Salesforce Engineering Blog and the AWS Machine Learning Blog. Salesforce, Inc. is a cloud-based software company based in San Francisco, California, specializing in customer relationship management (CRM) software and applications that enhance sales, customer service, marketing automation, e-commerce, analytics, and application development. The company is progressing towards artificial general intelligence (AGI) for business purposes, enabling predictive and generative functions within their flagship software-as-a-service (SaaS) CRM, while also working on intelligent automations utilizing artificial intelligence (AI) and agents.
Salesforce Einstein encompasses a suite of AI technologies that integrate seamlessly with Salesforce’s Customer Success Platform, aimed at boosting productivity and improving client engagement. With over 60 features categorized into four main areas—machine learning (ML), natural language processing (NLP), computer vision, and automatic speech recognition—Einstein offers advanced AI capabilities across sales, service, and marketing functions. Notably, it includes out-of-the-box features like sales email generation in Sales Cloud and service replies in Service Cloud, along with tools such as Copilot, Prompt, and Model Builder, available in the Einstein 1 Studio for organizations to create custom AI functionalities.
The Salesforce Einstein AI Platform team is dedicated to developing Einstein applications with a focus on enhancing AI model performance, particularly large language models (LLMs) to be used with Einstein products. Their commitment involves continuously refining LLMs and AI models by integrating cutting-edge solutions and collaborating with top technology providers, including open-source communities and public cloud services like AWS, while building a unified AI platform. This ensures Salesforce customers benefit from the latest advancements in AI technology.
In this article, we explore how the Salesforce Einstein AI Platform team improved the latency and throughput of their code generation LLM using Amazon SageMaker.
Challenges in Hosting LLMs
At the start of 2023, the team began exploring options for hosting CodeGen, Salesforce’s in-house open-source LLM designed for code understanding and generation. The CodeGen model enables users to convert natural language, such as English, into programming languages like Python. Having already utilized AWS for inference of smaller predictive models, they aimed to extend the Einstein platform to host CodeGen. Salesforce developed a suite of CodeGen models tailored for the Apex programming language—Inline for code completion, BlockGen for code block generation, and FlowGPT for process flow generation. Their goal was to find a secure hosting solution capable of managing high volumes of inference requests and multiple concurrent requests at scale, while meeting the throughput and latency demands of their co-pilot application (EinsteinGPT for Developers). This application streamlines development by generating intelligent Apex code from natural language prompts, assisting developers in identifying code vulnerabilities and obtaining real-time coding suggestions within the Salesforce integrated development environment (IDE).
The Einstein team thoroughly evaluated various tools and services, including open-source and commercial options. Ultimately, they determined that SageMaker offered the best access to GPUs, scalability, flexibility, and performance optimizations that addressed their latency and throughput challenges.
Why Salesforce Einstein Chose SageMaker
SageMaker provided several key features essential to fulfilling Salesforce’s requirements:
- Multiple Serving Engines: SageMaker includes specialized deep learning containers (DLCs), libraries, and tools for model parallelism and large model inference (LMI) containers. These high-performance Docker containers are specifically designed for LLM inference. They allow the use of open-source inference libraries like FasterTransformer, TensorRT-LLM, vLLM, and Transformers NeuronX, providing an all-in-one LLM serving solution. The Einstein team appreciated the quick-start notebooks that enabled rapid deployment of popular open-source models.
- Advanced Batching Strategies: The SageMaker LMI facilitates performance optimization of LLMs through batching, which groups multiple requests before they reach the model. The dynamic batching feature allows the server to wait for a set time to combine all requests occurring within that window, up to a maximum of 64 requests, while considering a configured preferred size. This approach optimizes GPU resource usage and balances throughput with latency, ultimately minimizing the latter. The Einstein team successfully used dynamic batching to enhance throughput for their CodeGen models while reducing latency.
- Efficient Routing Strategy: SageMaker endpoints typically employ a random routing strategy, but they also support a least outstanding requests (LOR) strategy. This enables optimal request routing to the most suitable instance for each request by monitoring the load on instances and the models deployed on them. The flexibility to choose between routing algorithms based on workload needs, along with the capability to manage multiple model instances across various GPUs, ensured that traffic was distributed effectively to prevent bottlenecks.
- Access to High-End GPUs: SageMaker provides access to top-tier GPU instances, crucial for the efficient operation of LLMs, especially during a time of market shortages. The Einstein team benefited from the auto-scaling feature of these GPUs to accommodate demand without manual intervention.
- Rapid Iteration and Deployment: Although this isn’t directly linked to latency, SageMaker’s notebooks allow for quick testing and deployment of changes, shortening the overall development cycle. This acceleration can indirectly enhance latency by speeding up the implementation of performance enhancements.
These capabilities collectively optimize LLM performance by reducing latency and improving throughput, making Amazon SageMaker a robust choice for Salesforce’s needs. For more insights on career growth and development, check out this additional blog post on cover letters, and for information on diversity initiatives, visit SHRM, an authority on this topic. Furthermore, if you’re interested in learning and development resources, explore what’s available at this link.
Leave a Reply